Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-11568 Elastic: improved compatibility for aggregation definitions #2193

Merged
merged 6 commits into from
Mar 31, 2025

Conversation

thomasmueller
Copy link
Member

@thomasmueller thomasmueller commented Mar 19, 2025

  • Analyzer configuration is now lenient, quite similar to the Lucene index behavior. This will allow converting Lucene indexes to Elasticsearch. Warnings are logged where needed.
  • This PR also removes unused code, and reduces compiler warnings.
  • The tests in ElasticIndexHelperTest are about problems trying to load files that are not configured (IllegalStateException etc.)
  • The tests in FullTextAnalyzerCommonTest are about compatibility problems
  • With the NGram Tokenizer (not filter), behaviour is different between Elastic and Lucene for one case: if the query contains multiple words, the result is found with Lucene, but not with Elastic.

Copy link

Commit-Check ✔️

Comment on lines +176 to +184
if ("n_gram".equals(name)) {
// OAK-11568
// https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
Integer minGramSize = getIntegerSetting(args, "minGramSize", 2);
Integer maxGramSize = getIntegerSetting(args, "maxGramSize", 3);
TokenizerDefinition ngram = TokenizerDefinition.of(t -> t.ngram(
NGramTokenizer.of(n -> n.minGram(minGramSize).maxGram(maxGramSize))));
return ngram;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay for now. We should structure it better to cover all the possible tokenizers (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html). This can go in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree!

name = "hyphenation_decompounder";
String hypenator = args.getOrDefault("hyphenator", "").toString();
LOG.info("Using the hyphenation_decompounder: " + hypenator);
args.put("hyphenation_patterns_path", "analysis/hyphenation_patterns.xml");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "analysis/hyphenation_patterns.xml" be installed in the Elastic nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use a fixed name, so it is possible to configure it. Installing this would have to be done manually, and we need to document it.

Comment on lines +295 to +307
if (skipEntry) {
continue;
}
String key = name + "_" + i;
filters.put(key, factory.apply(name, JsonData.of(args)));
if (name.equals("word_delimiter_graph")) {
wordDelimiterFilterKey = key;
} else if (name.equals("synonym")) {
if (wordDelimiterFilterKey != null) {
LOG.info("Removing word delimiter because there is a synonyms filter as well: " + wordDelimiterFilterKey);
filters.remove(wordDelimiterFilterKey);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option could be the use of https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-multiplexer-tokenfilter.html
We can work on this in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also thought about that. I didn't find a very good documentation about it yet.

@thomasmueller thomasmueller merged commit 25df414 into trunk Mar 31, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants